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Abstract 

Stimuli outside classical receptive fields significantly influence the neurons’ activities in 
primary visual cortex [1, 2, 3, 4, 5]. We propose that such contextual influences are used 
to segment regions by detecting the breakdown of homogeneity or translation invariance 
in the input, thus computing global region boundaries using local interactions. This is 
implemented in a biologically based model of VI, and demonstrated in examples of texture 
segmentation and figure-ground segregation. By contrast with traditional approaches, 
segmentation occurs without classification or comparison of features within or between 
regions and is performed by exactly the same neural circuit responsible for the dual problem 
of the grouping and enhancement of contours. 
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Recent experiments have pointed to the com¬ 
plexity of processing that occurs in VI [6, 7, 
8, 9, 3]. Not only can this processing deter¬ 
mine the gains and the classical tuning func¬ 
tions of cells, [6, 9, 10] but it also arranges for 
contextual influences on their activities from 
stimuli beyond their classical receptive fields 
(RFs)[l, 2, 3, 11, 4, 12, 13, 5]. The responses 
of cells depend on whether stimuli within and 
beyond the RFs share the same orientations[2, 
4, 11, 5], and whether the stimuli within the 
RFs are part of different regions, such as figure 
or ground [12, 13]. Horizontal intra-cortical con¬ 
nections are suggested to mediate the contextual 
influences [7, 3]. While there have been substan¬ 
tial experimental interest and some modeling in¬ 
terest (e.g., [14]) in these contextual influences, 
computational understanding of their roles in vi¬ 
sual processing is lagging far behind [1, 3]. 

We propose that the contextual influences in 
the primary visual cortex can serve the goal of 
visual grouping, i.e., inferring global visual ob¬ 
jects such as contours and regions from the lo¬ 
cal features captured by the RFs. Local fea¬ 
tures can group into regions, as in texture seg¬ 
mentation; or into contours which may repre¬ 
sent boundaries of underlying objects. We show 
how one form of global grouping, namely re¬ 
gion segmentation, can emerge from a simple 
but biologically-based model of VI which only 
involves finite-range cortical interactions. 

It has always been assumed, implicitly or ex¬ 
plicitly, that to segment one region from an¬ 
other, feature extraction and/or classification 
within a region and feature comparison between 
regions are required [15, 16, 17]. On the other 
hand, feature extraction or classification often 
require segmentation, thus creating a dilemma. 
In these traditional approaches, not only is fea¬ 
ture classification problematic near the bound¬ 
aries between regions, but also segmentation us¬ 
ing feature comparison is tricky in cases such 
as figure (3D), where the two regions have the 
same texture feature value but are segmentable 
in natural vision. Therefore, feature extraction 
or classification is not always necessary nor suffi¬ 


cient for segmentation. In fact, even with distin¬ 
guishable classification flags for all image areas 
in any two regions, segmentation is not com¬ 
pleted until another processing step locates the 
boundary, perhaps by searching for where the 
classification flags change. Therefore, we pro¬ 
pose that segmentation in its pre-attentive stage 
is segmentation without classification, i.e., seg¬ 
mentation without explicitly knowing the con¬ 
tents of the regions. This simplifies the segmen¬ 
tation process conceptually, making it feasible 
by low level processing in VI. This paper fo¬ 
cuses on this pre-attentive segmentation. Addi¬ 
tional processing is likely needed to improve the 
outcome based on pre-attentive segmentation, 
e.g., by filling in the contents of the regions. 

The model focuses on simple texture segmen¬ 
tation, i.e., region grouping without color, mo¬ 
tion, luminance, or stereo cues. A single texture 
region is defined by the homogeneity or trans¬ 
lation invariance of the statistics of the input 
features that define it, no matter what features 
are involved or, for instance, whether or not 
they are textons[18]. If cortical interactions are 
translation invariant and do not induce sponta¬ 
neous pattern formation (such as zebra stripes 
[19]) through the spontaneous breakdown of 
translation symmetry, then the cortical response 
to a homogenous region will itself be homoge¬ 
neous. However, homogeneity is disrupted at 
the boundary of a region. Consequently, a neu¬ 
ron near the boundary and another far from the 
boundary experience different contextual influ¬ 
ences, and thus exhibit different response lev¬ 
els. The location of the boundary can therefore 
be pinpointed by assessing where the contex¬ 
tual influences or neural response levels change. 
In the model, this breakdown in homogeneity 
gives relatively higher neural activities near the 
boundaries than away from them. This makes 
the boundaries relatively more salient, allowing 
them to pop out perceptually. Physiological ex¬ 
periments in VI indeed show that activity levels 
are higher near texture boundaries[20]. 

Figure (1) shows the elements of the model 
and their interactions. Based on experimental 



observations^, 9], a cortical column is modelled 
by recurrently connected excitatory cells and in¬ 
hibitory interneurons tuned to bars or edges. 
Quantities xw and are the membrane poten¬ 
tials of the excitatory and inhibitory cells having 
the RF center (or hypercolumn) location i and 
perferred orientation 0. The excitatory cell re¬ 
ceives external visual input /,;<? to the cortical 
cell, which is the retinal image filtered through 
the RF. These edge or bar inputs to the model 
are merely image primitives, which are in prin¬ 
ciple like the image pixel primitives and are re¬ 
versibly convertible from them. They are not 
to denote the texture feature values, e.g., the 
‘+’ or ‘x’ patterns and their spatial arrange¬ 
ments in the example of figure (3)C. Again, this 
model does not extract texture features in or¬ 
der to segment. The output from VI is pro¬ 
vided by the excitatory cells. Based on observa¬ 
tions by Gilbert, Lund and their colleagues[7, 3], 
horizontal connections Jie,j6' and Wig^gi link 
cells with different RF centers and similar ori¬ 
entation preferences to mediate contextual in¬ 
fluences. The membrane potentials follow the 
equations: 
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where a x Xig and a y yig model the decay to rest¬ 
ing potentials, g x (x) and g y (y ) are sigmoid-like 
functions modeling cells’ firing rates g x (x) and 
g y (y ) given membrane potentials x and y, re¬ 
spectively, % ! j(A9) the inhibition spread within 
a hypercolumn, J 0 g x ( x i6 ) the self excitation, I c 
and I Q are background inputs or inputs modeling 
the general and local activity normalization[21], 
and Jio,je'g x {xjO’) and W'm.jn’gAx.y) model the 
contextual influences (see [22, 23]) for more de¬ 
tails). 

The activity levels of the neurons g x {xjo) are 
initially set by just the visual input I r g. This 


input persists after its onset. The activities 
are then modified effectively within one mem¬ 
brane time constant by the cortical interaction 
that mediate the contextual influences. Mean 
field techniques and dynamic stability analysis 
are used to design the horizontal connections J 
and W to ensure that: (1) the system does not 
generate patterns spontaneously, i.e., the model 
gives spatially homogenous output for homoge¬ 
nous input images, (2) the region boundaries 
are relatively highlighted by modeling the phys¬ 
iologically observed iso-orientation suppression 
via the contextual influences (thereby making 
areas inside a region less salient), and (3) the 
same neural circuit performs contour enhance¬ 
ment (see [24] for more details). 

The model was applied to a variety of tex¬ 
tured inputs. Figure (2)A shows a sample in¬ 
put consisting of two regions, in which all the 
visible inputs I,o have the same strength. Fig¬ 
ure (2)B,C shows the output of the model, indi¬ 
cating that the activities of the neurons at the 
boundary are significantly higher than others. 
Figure (2)D confirms that the boundary can be 
identified by thresholding the final activities. 

Figure (3) shows other examples of input pat¬ 
terns and the thresholded outputs of the model. 
Note particularly in figures (3)A;B;C that the 
model copes well with textures defined by com¬ 
plex or stochastic patterns; from figure (3)D 
that it segments regions by detecting the break¬ 
down of homogeneity even though the two re¬ 
gions have the same texture feature, a feat dif¬ 
ficult in traditional approaches; in figure (3)E 
that both humans and the model have difficulty 
segmenting regions wdien the translation invari¬ 
ance is only broken very weakly; and in fig¬ 
ure (3)H that when a region is very small, all 
parts of it belong to the boundary and it pops 
out from the background. Figure (3)F;G show 
other examples where regions differ by the ori¬ 
entations of the texture elements. Finally, fig¬ 
ure (3)1 confirms that exactly the same model, 
with the same elements and parameters, can also 
highlight contours against a noisy background. 
This can be seen as another example of a break- 



down of translation invariance. Additional sim¬ 
ulations confirm that the model also performs 
reasonably well on many other examples. 

Our model to detect region boundaries is be¬ 
yond and different from the early visual process¬ 
ing using center-surround filters or the like[25]. 
There, the filters are tuned to detect contrast in 
luminance, they can detect the edge primitives 
in a textured region, and their outputs can be 
used as inputs to our model. However, these 
filters can not detect feature changes from one 
region to another, e.g., figure (2)A, that are not 
apparent in average luminance changes. If one 
were to design a one stage filter to detect the fea¬ 
ture changes between regions, the filter would be 
feature specific and many different kinds would 
be required to cover many possible region dif¬ 
ferences. The mechanism using cortical interac¬ 
tions in our model highlights conspicuous im¬ 
age locations or general feature changes from 
one region to another without specific tuning 
to any region features. While the early stage 
filters code image primitives [25], the mechanism 
in our model is aimed towards coding object sur¬ 
face primitives. 

It has recently been argued that texture 
analysis is performed at a low level of vi¬ 
sual processing[15], and indeed filter based 
models[16] and their non-linear extensions (e.g., 
[17]) capture well much of the phenomenology of 
psychophysical performance. However, all the 
previous models are based on the traditional 
approach of segmentation by feature classifica¬ 
tion/comparison, and thus share the problems 
associated with that approach. By performing 
segmentation without classification, our model 
differs from these in principle. Consequently, 
while our model employs only those low level 
visual operations that are consistent with ex¬ 
perimental observations[7, 8, 9, 3], the model by 
Malik and Perona[17], for instance, uses com¬ 
plicated forms of cortical interactions such as 
winner-take-all operations and spatial deriva¬ 
tives for which there exists little experimental 
evidence. In addition, our model is the first 
to perform region segmentation and contour en¬ 


hancement using exactly the same neural cir¬ 
cuit. This is desirable since regions and their 
boundary contours are complementary to each 
other. Furthermore, in our framework, small re¬ 
gions naturally pop out, as in figure (3H), filling- 
in in a non-homogeneous region would be the 
perceptual consequence of the model’s failing 
to highlight the non-homogeneity, and feature 
statistics in a region [26] are automatically ac¬ 
counted for for region segmentation. 

The components of the model and its behav¬ 
ior are consistent with experimental evidence[7, 
8, 9, 3, 20]. However, the model is obviously 
only an approximation to the true complexities 
of VI. For instance, all its elements are tuned 
to one scale, and exhibit none of the flexible 
adaptation that is pervasive in the real system. 
Therefore, the model sometimes finds it easier 
or more difficult to segment some regions than 
natural vision, for instance, not coping well with 
gradual changes in images caused by the tilt of 
a textured surface. Any given neural interac¬ 
tion will be more sensitive to some region differ¬ 
ences than others. Hence, a more detailed model 
of the neural elements and the connection pat¬ 
tern would be required to capture exactly the 
psychophysical data on segmentation in natu¬ 
ral pre-attentive vision. However, independent 
of such details, our results show the feasibility 
of the underlying ideas, that region segmenta¬ 
tion can occur without region classification, that 
breakdown of translation invariance can be used 
to segment regions, that region segmentation 
and contour detection can be addressed by the 
same mechanism, and that low-level processing 
in VI together with local contextual interactions 
can contribute significantly to visual computa¬ 
tions at global scales. 
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A sampling location One of the edge 
detectors 

B Neural connection pattern. 

Solid: J, Dashed: W 



C Model Neural Elements 

Edge outputs to higher visual areas 



Visual inputs to excitatory cells 


Figure 1: A: Visual inputs are sampled in a 
discrete grid by edge/bar detectors, referred to 
as edge or edge segments, modeling RFs in VI. 
Each grid point has K neuron pairs (see C), 
one per edge segment. All cells at a grid point 
share the same RF center, but are tuned to dif¬ 
ferent orientations spanning 180°, thus model¬ 
ing a hvpercolumn. An edge segment in one 
hypercolumn can interact with another in a dif¬ 
ferent hypercolumn via monosynaptic excitation 
J (the solid arrow from one thick bar to an¬ 
other), or disvnaptic inhibition W (the dashed 
arrow to a thick dashed bar). See also C. B: 
A schematic of the neural connection pattern 
from the center (thick solid) edge to neighbor¬ 
ing edges within a finite distance. J’s contacts 
are shown by thin solid edges. TV’s are shown 
by thin dashed edges. All edges have the same 
connection pattern, suitably translated and ro¬ 
tated from this one. C: An input edge seg¬ 
ment is associated with an interconnected pair 
of excitatory and inhibitory cells, each model 
cell models abstractly a local group of cells of 
the same type. The excitatory cell receives vi¬ 
sual input and sends output g x {xjo) to higher 
centers. The inhibitory cell is an interneuron. 
Activity levels g x {xjo) often oscillate over time 
[27, 28], which is an intrinsic property of a pop¬ 
ulation of recurrently connected excitatory and 
inhibitory cells. Temporal averages over multi¬ 
ple time constants after input onset are taken 
as the model output. The region dependence of 
the phases of the oscillations in this model could 
be exploited for segmentation [22], although it is 
beyond this paper. The visual space has toroidal 
(wrap-around) boundary conditions. 
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A: Input image to model 



C:Neural response levels for one of the rows 



Figure 2: A: Input 1^ of two regions; each 
visible edge has the same input strength. B: 
Model output for A, showing non-uniform out¬ 
put strengths (temporal averages of g x (x.io)) for 
the edges. The input and output edge strengths 
are proportional to the edge thicknesses shown. 
C: Output strengths (saliencies) vs. lateral lo¬ 
cations of the edges for a row like the bottom 
row in B, with the bar lengths proportional to 
the corresponding edge output strengthes. D: 
The thresholded output from B for illustration. 
Each plotted region shown here is actually a 
small part of, and extends continuously to, a 
larger image. The same format is used in other 
figures in this paper. 


D: Thresholded model output 
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Figure 3: Additional examples A, B, C, D, E, 
F, G, H, and I of model input images, each 
followed by the corresponding output highlights 
immediately below it. 
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